
In March 2017, the population of France officially reached 67,000,000 people. As in many countries in the world in the last decades, French population has tended to densify around a limited set of big cities which, as a consequence, concentrate the highest level of economic activities of the country.
It appears clearly that Paris is, by far, the densest area of the French territory (representing up to 19% of the French population) followed by Lyon and Marseille respectively in second and third position. With a population of approximately 1 Million inhabitants, metropolitan areas from the 4th to the 10th rank can all be considered, at the French scale, as illustrations of mid-size cities.
For French people specificities of Paris, Lyon and Marseille are often well known. This is due to the fact that those cities get strong national focus thanks to news, cinema, national sport events (i.e. Soccer) or even lessons provided during history / geography class at school.
Mid-size cities get a lower attention and French people would probably have more difficulties to underline differences or similarities between cities like Toulouse or Nantes for example.
As we did with New York and Toronto, my idea is to perform data exploration of a set of French midsize cities and to determine whether the resulting segmentation and clustering reflects some similarities or specificities.
Data exploration will be performed for the cities of Toulouse, Bordeaux and Nantes. Apart from the size criteria, selection of those cities is also correlated to the availability of proper location datasets on open platforms (details provided in the next chapter). As a complement to venues data already tested in the labs, exploration will embed economic data (i.e. houses prices).
Studies will be limited to the administrative limits of the cities: suburbs will not be taken into account.
Before we get the data and start exploring it, let's download all the dependencies that we will need.
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import json # library to handle JSON files
!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
# import k-means from clustering stage
from sklearn.cluster import KMeans
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library
print('Libraries imported.')
!pip install wget
import wget
Thanks to an open data platform managed by french government, nation wide houses sell prices for 2018 are freely available at : https://www.data.gouv.fr
## Getting access to list of houses sold in 2018 with addresses and prices
print('2018 French Houses Prices --> Beginning file download with wget module')
url = 'https://www.data.gouv.fr/fr/datasets/r/1be77ca5-dc1b-4e50-af2b-0240147e0346'
wget.download(url, 'houses_data.csv')
From the same platform, database containing lat, long coordinates of 16 Millions French streets is accessible Data is stored by French "Department" (French territory is divided in a set of administrative departments Toulouse is part of "31" Bordeaux is part of "33" Nantes is part of "44"
## Department 31 - Toulouse
!wget -O 31_streets.csv http://bano.openstreetmap.fr/data/bano-31.csv
## Department 33 - Bordeaux
!wget -O 33_streets.csv http://bano.openstreetmap.fr/data/bano-33.csv
## Department 44 - Nantes
!wget -O 44_streets.csv http://bano.openstreetmap.fr/data/bano-44.csv
import pandas as pd
## Loading data into a Pandas DF
houses_data = pd.read_csv('houses_data.csv', sep='|')
In this section we will build 3 distinct data sets (one for each city) with the objective of getting at the end, average houses prices per streets
It appears that some information of the data frame is useless : a lot of columns can be suppressed
### Dropping columns with unrelevant data for the project
houses_data=houses_data[["Valeur fonciere", "Voie","Code postal","Commune","Surface reelle bati"]]
## Building Data sets for each city
Toulouse_houses = houses_data[houses_data["Commune"]=="TOULOUSE"]
Bordeaux_houses = houses_data[houses_data["Commune"]=="BORDEAUX"]
Nantes_houses = houses_data[houses_data["Commune"]=="NANTES"]
Toulouse_houses.head()
Setting a Function to clean data : drop Na, Deleting records where surface = 0, calculating average houses price for each street
def Houses_data_clean(df):
# Dropping NaN
df=df.dropna()
## Removing records where surface is = 0
df=df[df["Surface reelle bati"] != 0]
## Converting Houses Prices (Valeur fonciere) to float
df['Valeur fonciere']=df['Valeur fonciere'].replace(',','.', regex=True).astype(float)
## Creating new columns showing price / m2
df["Price_m2"] = df["Valeur fonciere"] / df["Surface reelle bati"]
## Average Houses Price per Street (Voie) Calculation
df = df.groupby(['Voie', 'Commune'])["Price_m2"].mean().reset_index()
## Returning a quite simple DF with Street, City and Average Price per m2
return(df)
Cleaning Data and creating an additional column : Average price per squared m
Toulouse_houses = Houses_data_clean(Toulouse_houses)
Nantes_houses = Houses_data_clean(Nantes_houses)
Bordeaux_houses = Houses_data_clean(Bordeaux_houses)
Nantes_houses.head()
Price per m2 superior to 10000 Eur are fake values as no one would pay 200kEur for a single squared meter in Toulouse ! Let's remove values above 10000 Eur/m2
Toulouse_houses=Toulouse_houses[Toulouse_houses["Price_m2"] <= 10000]
Toulouse_houses["Price_m2"].hist(bins=10, alpha=0.5)
OK That's looks good.
Nantes_houses["Price_m2"].describe()
As for Toulouse, Prices per m2 superior to 10000 Eur are fake values as no one would pay for that for a single squared meter in Nantes ! Let's remove values above 10000 Eur/m2
Nantes_houses=Nantes_houses[Nantes_houses["Price_m2"] <= 10000]
Nantes_houses["Price_m2"].hist(bins=10, alpha=0.5)
Bordeaux_houses["Price_m2"].describe()
Bordeaux appears to be a more expensive city compared to the 2 other. This time, let's remove values above 20000 Eur/m2
Bordeaux_houses=Bordeaux_houses[Bordeaux_houses["Price_m2"] <= 20000]
Bordeaux_houses["Price_m2"].hist(bins=10, alpha=0.5)
Defining a Function which will be used to segment houses prices compared to Median Prices
def Houses_Prices_seg (df):
## Median Price calculation
df_median = df["Price_m2"].median()
print('Median Prices : ', df_median)
## Segmenting city prices in 6 categories : Low, Medium Low, Medium, Medium High and High
df["Houses_L"] = df["Price_m2"] <= (0.4 * df_median)
df["Houses_ML"] = (df["Price_m2"] > (0.4 * df_median)) & (df["Price_m2"] <= (0.8 * df_median))
df["Houses_M"] = (df["Price_m2"] > (0.8 * df_median)) & (df["Price_m2"] <= (1.2 * df_median))
df["Houses_MH"] = (df["Price_m2"] > (1.2 * df_median)) & (df["Price_m2"] <= (1.8 * df_median))
df["Houses_H"] = (df["Price_m2"] > (1.8 * df_median))
## Converting result to float
df["Houses_L"] = df["Houses_L"].astype(float)
df["Houses_ML"] = df["Houses_ML"].astype(float)
df["Houses_M"] = df["Houses_M"].astype(float)
df["Houses_MH"] = df["Houses_MH"].astype(float)
df["Houses_H"] = df["Houses_H"].astype(float)
return(df)
Bordeaux, is by far more expensive than the 2 other cities. In order to perform homogeneous comparison between those cities, we need to classify houses prices in categories defined according to local median prices : Low, Medium Low, Medium, Medium High, High Low <= 0.4 Median / ML 0.4 to 0.8 Median / Medium 0.8 to 1.2 / Medium High 1.2 to 1.8 / High above 1.8
Toulouse_houses = Houses_Prices_seg(Toulouse_houses)
Nantes_houses = Houses_Prices_seg(Nantes_houses)
Bordeaux_houses = Houses_Prices_seg(Bordeaux_houses)
Checking Final Data Frame structure (example given for one city)
Bordeaux_houses.head()
This looks fine, we can then perform the next Step !
import pandas as pd
import unicodedata
!pip install Unidecode
from unidecode import unidecode
## French Department data uplaod to dataframe
data_31 = pd.read_csv('31_streets.csv', sep=',', encoding='utf-8')
data_33 = pd.read_csv('33_streets.csv', sep=',', encoding='utf-8')
data_44 = pd.read_csv('44_streets.csv', sep=',', encoding='utf-8')
It appears that some information of the data frame is useless : some columns can be suppressed It also appears that column heads aren't properly named so let's affect an explicit name to them. Dropping Duplicates - in order to simplify our approach, Streets numbers weren't taken into account. As a consequence, duplicate addresses will be removed from DF to keep a single couple of lat / long par street names.
## Defining a Function to Clean Department Data
def clean_dept_data(df, city_name):
## Converting Address and City columns to Upper scale
df["City"]=df["City"].str.upper()
df["Address"]=df["Address"].str.upper()
## Suppressing Accents (French)
df["Address"]=df['Address'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')
## Dropping NaN
df=df.dropna()
## Keeping only data related to the City
df=df[df["City"]==city_name].reset_index(drop=True)
## Dropping Duplicates
df.drop_duplicates(subset='Address', keep='first', inplace=True)
df.reset_index(drop=True, inplace = True)
return(df)
Cleaning location data collected for the 3 cities
### TOULOUSE - Dropping columns with unrelevant data for the project
data_31=data_31[["Chemin des Acacias Vc 13", "31230", "Agassac", "43.371259", "0.889303"]]
data_31=data_31.rename(columns = {"Chemin des Acacias Vc 13" : "Address", "31230" : "Zip_Code", "Agassac" : "City", "43.371259" : "Latitude", "0.889303" : "Longitude"})
## Cleaning Data
toulouse_streets=clean_dept_data(data_31, "TOULOUSE")
### BORDEAUX - Dropping columns with unrelevant data for the project
data_33=data_33[["Rue Avitiacus", "33230", "Abzac", "45.014508", "-0.126840"]]
data_33=data_33.rename(columns = {"Rue Avitiacus" : "Address", "33230" : "Zip_Code", "Abzac" : "City", "45.014508" : "Latitude", "-0.126840" : "Longitude"})
## Cleaning Data
bordeaux_streets=clean_dept_data(data_33, "BORDEAUX")
### NANTES - Dropping columns with unrelevant data for the project
data_44=data_44[["Impasse de la Barre", "44170", "Abbaretz", "47.553330", "-1.530346"]]
data_44=data_44.rename(columns = {"Impasse de la Barre" : "Address", "44170" : "Zip_Code", "Abbaretz" : "City", "47.553330" : "Latitude", "-1.530346" : "Longitude"})
## Cleaning Data
nantes_streets=clean_dept_data(data_44, "NANTES")
In this section we will merge data recorded in the Houses Prices data frames with the one stored in addresses dataframe so we can get a unique data frame including house prices, address and corresponding latitude & longitude
import pandas as pd
import numpy as np
Creating a function which will add Lat & Longitude to houses df thanks to data collected from streets df
def houses_lat_long(houses_df, street_df):
## Importing Numpy Library
import numpy as np
## Adding New Latitude / Longitude Columns to houses df
houses_df["Latitude"] = np.nan
houses_df["Longitude"] = np.nan
for ind in houses_df.index:
houses_test = houses_df['Voie'][ind]
## Trying to find street name in Street Dataframe
street_ind = street_df['Address'][street_df['Address'].str.find(houses_test) != -1]
## Checking if street_ind is not empty
if not street_ind.empty:
## Assigning Latitude / Longitude collected from streets DF to houses DF
houses_df['Latitude'][ind] = street_df.iloc[street_ind.index[0]]['Latitude']
houses_df['Longitude'][ind] = street_df.iloc[street_ind.index[0]]['Longitude']
## Dropping NaN values
houses_df.dropna(inplace=True)
return()
## Applying transformation to all cities
houses_lat_long(Toulouse_houses,toulouse_streets)
houses_lat_long(Nantes_houses,nantes_streets)
houses_lat_long(Bordeaux_houses,bordeaux_streets)
Nantes_houses.head()
Good ! Now, for each city, we have a file with segmented houses price per street + corresponding lat / long. Let's go to the next step !
Let's visualize Prices on a map of the city. First, let's define a function to simplify plots
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library
!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
print('Libraries imported.')
def city_plot(city_name, df, zoom_level=12):
## Defining a color dictionnary
df_color = df["Houses_L"] + df["Houses_ML"] * 2 + df["Houses_M"]*3 + df["Houses_MH"]*4 +df["Houses_H"]*5
color_dict = {1.0:'blue', 2.0:'yellow', 3.0:'orange', 4.0:'red', 5.0:'black'}
## First, create a geolocator object
geolocator = Nominatim(user_agent="explorer")
location = geolocator.geocode(city_name)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of the City are {}, {}.'.format(latitude, longitude))
# create map of the city using latitude and longitude values
map_city = folium.Map(location=[latitude, longitude], zoom_start=zoom_level)
# add markers to map
for lat, lng, street, price in zip(df['Latitude'], df['Longitude'], df['Voie'], df_color):
label = '{},{}'.format(street, price)
label = folium.Popup(label, parse_html=True)
folium.CircleMarker(
[lat, lng],
radius=3,
popup=label,
color=color_dict[price],
fill=True,
fill_color=color_dict[price],
fill_opacity=0.5,
parse_html=False).add_to(map_city)
display(map_city)
return
Thanks to those new data, I used the Folium library to visualize how house prices are distributed on the cities maps. I applied a color dictionary to distinguish belongings to house prices categories: Blue for Low, yellow for Medium Low, Orange for Medium, Red for Medium High and Black for High.
city_plot("Toulouse", Toulouse_houses, zoom_level=12)
city_plot("Nantes", Nantes_houses, zoom_level=12)
city_plot("Bordeaux", Bordeaux_houses, zoom_level=12)
CLIENT_ID = '---' # your Foursquare ID - Note hidden credentials
CLIENT_SECRET = '---' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import json # library to handle JSON files
## Resetting Indexes
Toulouse_houses.reset_index(drop=True)
Nantes_houses.reset_index(drop=True)
Bordeaux_houses.reset_index(drop=True)
Let's reuse the function defined during the Labs
def getNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT = 20):
venues_list=[]
for name, lat, lng in zip(names, latitudes, longitudes):
print(name)
# create the API request URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
CLIENT_ID,
CLIENT_SECRET,
VERSION,
lat,
lng,
radius,
LIMIT)
# make the GET request
results = requests.get(url).json()["response"]['groups'][0]['items']
# return only relevant information for each nearby venue
venues_list.append([(
name,
lat,
lng,
v['venue']['name'],
v['venue']['location']['lat'],
v['venue']['location']['lng'],
v['venue']['categories'][0]['name']) for v in results])
nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
nearby_venues.columns = ['Neighborhood',
'Neighborhood Latitude',
'Neighborhood Longitude',
'Venue',
'Venue Latitude',
'Venue Longitude',
'Venue Category']
return(nearby_venues)
import time
In order to limit number of Requests to Foursquare API, data will be segmented into groups of 100 values
## This function segments Houses Dataframe in "s" rows and then calls the Foursquare API
def Foursquare_seg(df, city):
import time
## Initialisations
S=50
N = int(len(df)/S)
frames = [ df.iloc[i*S:(i+1)*S].copy() for i in range(N+1) ]
for n in range(N):
print(n)
nom = frames[n]['Voie']
lati = frames[n]['Latitude']
longi = frames[n]['Longitude']
## Getting Venues for Toulouse
df_venues = getNearbyVenues(names=nom,
latitudes=lati,
longitudes=longi
)
## Writting Result to file
df_venues.to_csv(r'_'+city+'_venues_0'+str(n)+'.csv')
## Assigning venues to a global df
if n ==0:
res_df_venues = df_venues
else:
res_df_venues= res_df_venues.append(df_venues, ignore_index = True)
# Wait for 5 seconds
time.sleep(5)
print('====== WAIT 5 SEC =======')
return(res_df_venues)
nantes_venues= Foursquare_seg(Nantes_houses,'Nantes')
## Cheking Head of dataframe
print(nantes_venues.shape)
nantes_venues.head()
## Saving venues to a global file
nantes_venues.to_csv(r'_nantes_venues_2019109.csv')
toulouse_venues= Foursquare_seg(Toulouse_houses,'Toulouse')
## Cheking Head of dataframe
print(toulouse_venues.shape)
toulouse_venues.head()
## Saving venues to a global file
toulouse_venues.to_csv(r'_toulouse_venues_20191009.csv')
bordeaux_venues= Foursquare_seg(Bordeaux_houses,'Bordeaux')
## Cheking Head of dataframe
print(bordeaux_venues.shape)
bordeaux_venues.head()
## Saving venues to a global file
bordeaux_venues.to_csv(r'_bordeaux_venues_20191009.csv')
Reading data from saved csv files (built thanks to Foursquare API), this will avoid long execution times required to get data from Foursquare
import pandas as pd
toulouse_venues=pd.read_csv('_toulouse_venues_20191009.csv')
toulouse_venues=toulouse_venues.rename(columns={'Unnamed: 0': 'City'})
toulouse_venues['City']='TOULOUSE'
toulouse_venues.head()
nantes_venues=pd.read_csv('_nantes_venues_20191009.csv')
nantes_venues=nantes_venues.rename(columns={'Unnamed: 0': 'City'})
nantes_venues['City']='NANTES'
nantes_venues.head()
bordeaux_venues=pd.read_csv('_bordeaux_venues_20191009.csv', index_col=False)
bordeaux_venues=bordeaux_venues.rename(columns={'Unnamed: 0': 'City'})
bordeaux_venues['City']='BORDEAUX'
bordeaux_venues.head()
Let's check how many venues were returned for Toulouse
toulouse_venues.groupby('Neighborhood').count()
Let's find out how many unique categories can be curated from all the returned venues
print('There are {} uniques categories for Toulouse.'.format(len(toulouse_venues['Venue Category'].unique())))
Let's check how many venues were returned for Nantes
nantes_venues.groupby('Neighborhood').count()
Let's find out how many unique categories can be curated from all the returned venues
print('There are {} uniques categories for Nantes.'.format(len(nantes_venues['Venue Category'].unique())))
Let's check how many venues were returned for Bordeaux
bordeaux_venues.groupby('Neighborhood').count()
print('There are {} uniques categories for Bordeaux.'.format(len(bordeaux_venues['Venue Category'].unique())))
Creating a Dataframe containing all cities venues. The objective of this approach is to have vectors of similar sizes for each city. At the end, this will allow clusters comparison between the 3 cities taken into consideration for this project.
global_venues=pd.concat([toulouse_venues, nantes_venues, bordeaux_venues])
global_venues.head()
Performing One Hot Encoding with this dataframe
from sklearn import preprocessing
# one hot encoding
venues_onehot = pd.get_dummies(global_venues[['Venue Category']], prefix="", prefix_sep="")
venues_onehot.head()
# add city column back to dataframe
venues_onehot['City'] = global_venues['City']
venues_onehot['Neighborhood'] = global_venues['Neighborhood']
# move city column to the first column
fixed_columns = [venues_onehot.columns[-1]] + list(venues_onehot.columns[:-1])
venues_onehot = venues_onehot[fixed_columns]
fixed_columns = [venues_onehot.columns[-1]] + list(venues_onehot.columns[:-1])
venues_onehot = venues_onehot[fixed_columns]
Splitting this dataframe for the 3 cities
toulouse_onehot=venues_onehot[venues_onehot['City']=='TOULOUSE']
nantes_onehot=venues_onehot[venues_onehot['City']=='NANTES']
bordeaux_onehot=venues_onehot[venues_onehot['City']=='BORDEAUX']
First, let's write a function to sort the venues in descending order.
def return_most_common_venues(row, num_top_venues):
row_categories = row.iloc[1:]
row_categories_sorted = row_categories.sort_values(ascending=False)
return row_categories_sorted.index.values[0:num_top_venues]
This function is defined to perform deeper exploration of each city. Arguments : df = df containing list of venues for each city
def venues_analysis(df_venues, df_houses):
from sklearn import preprocessing
# one hot encoding
#df_onehot = pd.get_dummies(df_venues[['Venue Category']], prefix="", prefix_sep="")
# add neighborhood column back to dataframe
#df_onehot['Neighborhood'] = df_venues['Neighborhood']
# move neighborhood column to the first column
#fixed_columns = [df_onehot.columns[-1]] + list(df_onehot.columns[:-1])
df_onehot = df_venues.drop(['City'], axis=1)
# Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
df_grouped = df_onehot.groupby('Neighborhood').mean().reset_index()
df_houses_tmp = df_houses[['Voie','Houses_L', 'Houses_ML', 'Houses_M', 'Houses_MH', 'Houses_H']].copy()
df_houses_tmp['Houses_L'] = df_houses_tmp['Houses_L'] / 5
df_houses_tmp['Houses_ML'] = df_houses_tmp['Houses_ML'] / 5
df_houses_tmp['Houses_M'] = df_houses_tmp['Houses_M'] / 5
df_houses_tmp['Houses_MH'] = df_houses_tmp['Houses_MH'] / 5
df_houses_tmp['Houses_H'] = df_houses_tmp['Houses_H'] / 5
df_grouped_tmp = df_grouped.join(df_houses_tmp.set_index('Voie'), on='Neighborhood')
#Now let's create the new dataframe and display the top 10 venues for each neighborhood
num_top_venues = 5
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
try:
columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
except:
columns.append('{}th Most Common Venue'.format(ind+1))
# create a new dataframe
df_neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
df_neighborhoods_venues_sorted['Neighborhood'] = df_grouped['Neighborhood']
for ind in np.arange(df_grouped.shape[0]):
df_neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(df_grouped.iloc[ind, :], num_top_venues)
# merge venues data with houses prices data to add latitude/longitude for each neighborhood
#df_tempo = df_houses.join(df_neighborhoods_venues_sorted.set_index('Neighborhood'), on='Voie')
df_tempo = df_neighborhoods_venues_sorted.join(df_houses.set_index('Voie'), on='Neighborhood')
# Removing unnecessary df
df_merged=df_tempo.reset_index(drop=True)
return df_merged, df_grouped_tmp
# City of Toulouse
toulouse_merged, toulouse_grouped = venues_analysis(toulouse_onehot, Toulouse_houses)
# City of Nantes
nantes_merged, nantes_grouped = venues_analysis(nantes_onehot, Nantes_houses)
# City of Bordeaux
bordeaux_merged, bordeaux_grouped = venues_analysis(bordeaux_onehot, Bordeaux_houses)
Run k-means to cluster the neighborhood into clusters.
# import k-means from clustering stage
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
The function below will be used to run Kmeans algortithm multiple times. It returns Silhouette & Inertia scores in order to analyze which value of K (nb of clusters) will be optimum
def city_Kmean(df_grouped, df_merged, clusterMAX = 20):
#sil_score = pd.DataFrame(np.zeros((clusterMAX, 1)))
score_df = pd.DataFrame(columns = ['Sil', 'Inertia','K_clust'])
### Building cluster DF - Remove string columns => Neighborhood (Voie)
df_grouped_clustering = df_grouped.drop('Neighborhood', 1)
for kclusters in range (2,clusterMAX):
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df_grouped_clustering)
sil_score = silhouette_score(df_grouped_clustering, kmeans.labels_)
print('Silhouette score : ', sil_score)
print('Inertia: ', kmeans.inertia_)
print('Kclusters :',kclusters )
score_df = score_df.append({'Sil':sil_score, 'Inertia':kmeans.inertia_, 'K_clust': kclusters}, ignore_index=True)
return score_df
Once nb of clusters will be determined, the function below will be called to return a dataframe containing cluster category for each Neighborhodd
def city_Kmean_Neigh(df_grouped, df_merged, kclusters = 10):
### Building cluster DF - Remove string columns => Neighborhood (Voie)
df_grouped_clustering = df_grouped.drop('Neighborhood', 1)
df_grouped_tmp = df_grouped
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df_grouped_clustering)
# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]
# add clustering labels
df_grouped_tmp.insert(0, 'Cluster Labels', kmeans.labels_)
# Get centroids position
df_centroids=kmeans.cluster_centers_
# merge df_grouped with houses prices to add latitude/longitude for each neighborhood
df_merged_cluster = df_merged.join(df_grouped_tmp[['Neighborhood','Cluster Labels']].set_index('Neighborhood'), on='Neighborhood')
return df_merged_cluster, df_centroids
toulouse_score_df=city_Kmean(toulouse_grouped, toulouse_merged)
toulouse_score_df.plot(y='Inertia', use_index=True)
toulouse_score_df.plot(y='Sil', use_index=True)
The optimimum value of K appears to be 12. Let's call K means with this value and build a dataframe containing categories, venues & houses prices
toulouse_merged_cluster, toulouse_centroids = city_Kmean_Neigh (toulouse_grouped, toulouse_merged, kclusters = 12)
nantes_score_df=city_Kmean(nantes_grouped, nantes_merged, clusterMAX = 20)
nantes_score_df.plot(y='Inertia', use_index=True)
nantes_score_df.plot(y='Sil', use_index=True)
The optimimum value of K appears to be 6. Let's call K means with this value and build a dataframe containing categories, venues & houses prices
nantes_merged_cluster, nantes_centroids = city_Kmean_Neigh (nantes_grouped, nantes_merged, kclusters = 6)
bordeaux_score_df=city_Kmean(bordeaux_grouped, bordeaux_merged, clusterMAX = 20)
bordeaux_score_df.plot(y='Inertia', use_index=True)
bordeaux_score_df.plot(y='Sil', use_index=True)
The optimimum value of K appears to be 3. Let's call K means with this value and build a dataframe containing categories, venues & houses prices
bordeaux_merged_cluster , bordeaux_centroids = city_Kmean_Neigh (bordeaux_grouped, bordeaux_merged, kclusters = 3)
Let's visualize the resulting clusters - Let's build a function for that
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
def city_cluster_viz(kclusters, df_merged_cluster, city_name):
## First, create a geolocator object
geolocator = Nominatim(user_agent="explorer")
location = geolocator.geocode(city_name)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of the City are {}, {}.'.format(latitude, longitude))
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_merged_cluster['Latitude'], df_merged_cluster['Longitude'], df_merged_cluster['Neighborhood'], df_merged_cluster['Cluster Labels'].astype(int)):
label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
folium.CircleMarker(
[lat, lon],
radius=5,
popup=label,
color=rainbow[cluster-1],
fill=True,
fill_color=rainbow[cluster-1],
fill_opacity=0.7).add_to(map_clusters)
display(map_clusters)
return
city_cluster_viz(12, toulouse_merged_cluster, 'TOULOUSE')
city_cluster_viz(6, nantes_merged_cluster, 'NANTES')
city_cluster_viz(3, bordeaux_merged_cluster, 'BORDEAUX')
It appears that the spatial repartition of clusters differs from one city to the other. While we can notice that some clustering areas are quite dense for the cities of Toulouse and Nantes (i.e. dense red area close to the city center of Nantes), repartition of clusters for the city of Bordeaux is closer to a homogeneous distribution of single nodes over the city map.
It seemed interesting to me to determine, for each city, the number of neighborhoods (one neighborhood = one city street) contained in each cluster. This distribution is illustrated by the horizontal bar plots below:
ax = toulouse_merged_cluster.groupby('Cluster Labels').count()[['Neighborhood']].plot.barh(alpha=0.5)
ax = nantes_merged_cluster.groupby('Cluster Labels').count()[['Neighborhood']].plot.barh(alpha=0.5)
ax =bordeaux_merged_cluster.groupby('Cluster Labels').count()[['Neighborhood']].plot.barh(alpha=0.5)
To compare city clusters between each other, we will measure distance between their centroids. Euclidian distance will be used as a metric. For this, we import euclidian distances from sklearn
from sklearn.metrics.pairwise import euclidean_distances
Now let's measure euclidian distance between each city clusters : bordeaux_nantes, bordeaux_toulouse and nantes_toulouse. distances will be stored in a numpy array
dist_bordeaux_nantes=euclidean_distances(bordeaux_centroids, nantes_centroids)
dist_bordeaux_toulouse=euclidean_distances(bordeaux_centroids, toulouse_centroids)
dist_nantes_toulouse=euclidean_distances(nantes_centroids, toulouse_centroids)
Let's use bar plots to visualize inter centroids distances
# First, import Matplotlib
import matplotlib.pyplot as plt
Let's visualize distances between the clusters of Bordeaux and the clusters of Nantes
plt.barh(np.arange(len(dist_bordeaux_nantes[0,:])),dist_bordeaux_nantes[0,:])
plt.title('Distance between Cluster 0 of Bordeaux & Clusters of Nantes')
plt.barh(np.arange(len(dist_bordeaux_nantes[1,:])),dist_bordeaux_nantes[1,:])
plt.title('Distance between Cluster 1 of Bordeaux & Clusters of Nantes')
plt.barh(np.arange(len(dist_bordeaux_nantes[2,:])),dist_bordeaux_nantes[2,:])
plt.title('Distance between Cluster 2 of Bordeaux & Clusters of Nantes')
Let's visualize distances between the clusters of Bordeaux and the clusters of Toulouse
plt.barh(np.arange(len(dist_bordeaux_toulouse[0,:])),dist_bordeaux_toulouse[0,:])
plt.title('Distance between Cluster 0 of Bordeaux & Clusters of Toulouse')
plt.barh(np.arange(len(dist_bordeaux_toulouse[1,:])),dist_bordeaux_toulouse[1,:])
plt.title('Distance between Cluster 1 of Bordeaux & Clusters of Toulouse')
plt.barh(np.arange(len(dist_bordeaux_toulouse[2,:])),dist_bordeaux_toulouse[2,:])
plt.title('Distance between Cluster 2 of Bordeaux & Clusters of Toulouse')
Let's visualize distances between the clusters of Nantes and the clusters of Toulouse
plt.barh(np.arange(len(dist_nantes_toulouse[0,:])),dist_nantes_toulouse[0,:])
plt.title('Distance between Cluster 0 of Nantes & Clusters of Toulouse')
plt.barh(np.arange(len(dist_nantes_toulouse[1,:])),dist_nantes_toulouse[1,:])
plt.title('Distance between Cluster 1 of Nantes & Clusters of Toulouse')
plt.barh(np.arange(len(dist_nantes_toulouse[2,:])),dist_nantes_toulouse[2,:])
plt.title('Distance between Cluster 2 of Nantes & Clusters of Toulouse')
plt.barh(np.arange(len(dist_nantes_toulouse[3,:])),dist_nantes_toulouse[3,:])
plt.title('Distance between Cluster 3 of Nantes & Clusters of Toulouse')
plt.barh(np.arange(len(dist_nantes_toulouse[4,:])),dist_nantes_toulouse[4,:])
plt.title('Distance between Cluster 4 of Nantes & Clusters of Toulouse')
plt.barh(np.arange(len(dist_nantes_toulouse[5,:])),dist_nantes_toulouse[5,:])
plt.title('Distance between Cluster 5 of Nantes & Clusters of Toulouse')
Now let's find minimum distances in order to focus on clusters being the closest to each other
## Bordeaux to Nantes minimum distance
print('Minimum distance between Bordeaux and Nantes', np.amin(dist_bordeaux_nantes))
print('Index position of the smallest distance: ',np.argmin(dist_bordeaux_nantes))
print('Distance between clusters of Bordeaux and Nantes arrays')
print(dist_bordeaux_nantes)
Cluster 2 of Bordeaux and Cluster 0 of Nantes are the closest. Note that Clusters 0 of Bordeaux and Nantes are also close to each other
## Bordeaux to Toulouse minimum distance
print('Minimum distance between Bordeaux and Toulouse', np.amin(dist_bordeaux_toulouse))
print('Index position of the smallest distance: ',np.argmin(dist_bordeaux_toulouse))
print('Distance between clusters of Bordeaux and Toulouse arrays')
print(dist_bordeaux_toulouse)
Cluster 2 of Bordeaux and Cluster 5 of Toulouse are the closest. Note that cluster 1 of Bordeaux and Cluster 5 of Toulouse are also close to each other
## Nantes to Toulouse minimum distance
print('Minimum distance between Nantes and Toulouse', np.amin(dist_nantes_toulouse))
print('Index position of the smallest distance: ',np.argmin(dist_nantes_toulouse))
print('Distance between clusters of Nantes and Toulouse arrays')
print(dist_nantes_toulouse)
Cluster 0 of Nantes and Cluster 5 of Toulouse are the closest. All other clusters centroids have high distances
Let's analyse content of similar clusters : Bordeaux Cluster 0 and Nantes Cluster 0
## Let's visualize content of Bordeaux cluster 0
bordeaux_merged_cluster.loc[bordeaux_merged_cluster['Cluster Labels'] == 0, bordeaux_merged_cluster.columns[[1] + list(range(2, bordeaux_merged_cluster.shape[1]))]]
## Let's visualize content of Nantes cluster 0
nantes_merged_cluster.loc[nantes_merged_cluster['Cluster Labels'] == 0, nantes_merged_cluster.columns[[1] + list(range(2, nantes_merged_cluster.shape[1]))]]
Not easy to see similarities. That's why it's interesting to use distance metrics !
Let's check content of Toulouse cluster 5 which is supposed to be close to cluster 0 of Nantes
toulouse_merged_cluster.loc[toulouse_merged_cluster['Cluster Labels'] == 5, toulouse_merged_cluster.columns[[1] + list(range(2, toulouse_merged_cluster.shape[1]))]]
Well, to be honest, it’s not easy to identify similarities at the first sight. Having a look at venues and houses prices may help to define this cluster as middle class residential areas (houses prices being mainly medium or medium high).
As stated as a presumption this project shows that similar patterns can be identified between French midsize cities thanks to K-Means clustering algorithm. The large number of columns (over 300) characterizing the clusters added to the large number of samples (rows) linked to each cluster highlight that inter-clusters comparison cannot be performed by a single human brain.
The usage of mathematical metrics appears as an absolute necessity to get an objective way of measuring similarities between clusters. In our case, Euclidian distance between clusters centroids was used as the main indicator of similarities.
The methodology applied to the project can easily be extended to other cities to identify existing similarities between other French main cities.
As a next step to this project, getting a deeper focus on each cluster could contribute to assign labels to each of them (i.e. “residential area”, “shopping area”, etc.) with the objective, at the end, to have a more explicit vision of the inner structure of the cities.
Running this project was really exciting for me as it offered me real opportunities to apply what I learned during this class on real problems and real data.
This project gives precious inputs to investors as it highlights that similarities are existing between cities. This could help them to better target their business by taking benefits of economies of scales by running similar projects on similar city clusters.